Comparison of observer based methods for source localisation in complex networks

In recent years, research on methods for locating a source of spreading phenomena in complex networks has seen numerous advances. Such methods can be applied not only to searching for the “patient zero” in epidemics, but also finding the true sources of false or malicious messages circulating in the online social networks. Many methods for solving this problem have been established and tested in various circumstances. Yet, we still lack reviews that would include a direct comparison of efficiency of these methods. In this paper, we provide a thorough comparison of several observer-based methods for source localisation on complex networks. All methods use information about the exact time of spread arrival at a pre-selected group of vertices called observers. We investigate how the precision of the studied methods depends on the network topology, density of observers, infection rate, and observers’ placement strategy. The direct comparison between methods allows for an informed choice of the methods for applications or further research. We find that the Pearson correlation based method and the method based on the analysis of multiple paths are the most effective in networks with synthetic or real topologies. The former method dominates when the infection rate is low; otherwise, the latter method takes over.


Computation time
Not all the methods used by us have their complexity declared clearly in the original papers and those that do often have slightly different assumptions in their analysis. As such here we try to unify the complexity analysis for all the methods. Starting with LPTV, in the original paper authors claim O(N 3 ) assuming that a breadth-first search (BFS) tree is O(N 2 ) and it is done for every node. However, there are two aspects to be addressed here. Firstly, the complexity of a BFS is O(E + N) (where E is the number of edges) and if the mean degree k → ∞ as N → ∞ then it is indeed O(N 2 ). Using an example of a full graph for simplicity, we get E = N(N − 1)/2 hence O(N 2 + N) = O(N 2 ). On the other hand if k → const. as N → ∞ then we have O( k N + N) = O(N). Secondly, and in this case more importantly, since Pinto et al. assume the number of the observers to stay constant, they neglect the complexity of matrix inverse operations. Since we assume the density of the observers to be constant we can no longer do that. Matrix inverse complexity varies depending on the method but assuming the Gauss algorithm, for an n × n matrix we have O(n 3 ). In our case n = d · N where d is the observer density. As such the matrix inverse becomes O(N 3 ) and we need to do that for each node hence the complexity of the LPTV method is in fact O(N 4 ), regardless of how we compute the complexity of the BFS as the matrix inverse becomes the dominant factor.
For the related method -EPL, via a very similar reasoning we get O(N 3 ). Again the matrix inverse operation is the dominant factor, however, unlike in LPTV we only do it once and not for every node separately.
GMLA, also related to LPTV, being a gradient variant with limited number of observers, takes the order of √ N observers (instead of O(N)) and computes the score for log (N) nodes with yet again having the matrix inverse as the bulk of the computation giving O(N 3/2 log (N)). Note that the √ N rule is not set in stone and can be chosen to be different, thus changing the exponent of N 3/2 . TRBS and PC are very similar to each other as their computation is dominated by finding shortest paths and computing an O(K) measure on a set of K observers (either a variance in case of TRBS or a Pearson's coefficient in PC). As such we get for both of them either O(N 3 ) if k → ∞ (see the analysis of LPTV above) or O(N 2 ) otherwise. Do note that in the their original paper Shen et al. provide O(N 2 log (N)) without much explanation. We suspect this is due to a weighted graph shortest path algorithm complexity (Dijkstra) being conducted for each node giving N · N log (N). Since here we do not consider a weighted variant of the localisation problem we can use the BFS algorithm instead of Dijkstra.  2). The fits are made for the networks of sizes of at least 10 3 nodes., except GMLA for which the minimum size of network is 10 4 nodes. The numbers in parenthesis show 95% Confidence Interval. Simulations were conducted on a machine with the AMD® EPYC™ 7452 2.35GHz.

12/27
Location performance response to observer density Intuitively we would expect the precision to be a monotonously non-decreasing function of the observer density, however, amongst our experiments we observer several examples of this intuition being broken. See, e.g., Fig. S1 EPL with HVO for β = 0.5 a downward trajectory from density d = 0.1 to d = 0.15 or Fig. S3 PC with HVO for β = 0.2, densities d = 0.15 and d = 0.2 (this list is not exhaustive).
In this section we propose three possible explanations for this, although in earnest we are unable to conclusively answer why this happens.
Firstly, it is possible that our statistics are insufficient. Despite conducting tens of thousands of simulations the complexity of the problem can be such that it requires orders of magnitude more to reach something of a stable relations of precision and observer density. To test this we took 70 different realisations of a Barabási-Albert graph and simulated 100 Susceptible-Infected cascades on it, attempted to find the source, and analysed each BA instance separately. All of this data combined is what is presented in main text and in other sections of this supplementary information. In Fig. S13 we show box plots for each of this instances for different infection rates (β ) for the aforementioned downward transition for EPL and HVO, i.e., d = 0.15. What can be seen is that the results can indeed vary widely for the same type even with a relatively high β . Additionally, to exclude the effects of not sufficient statistics, we have used the Kruskal's test for the variant EPL+HVO, β = 0.5 (see Fig. S1), and obtained significance level α = 0.025 that we can reject the hypothesis of the precision for d = 0.1 and d = 0.15 being equal. Moreover, Welch's Two-sample T-test shows that we can reject this hypothesis on the significance level α = 0.008482 in favor of the alternative hypothesis that the precision for d = 0.1 is indeed higher than for d = 0.15. One can visually inspect this with the Fig. S14 showing box plots of precision as a function of the density for EPL and HVO.
Secondly, HVO is a rather particular placement strategy that the different density level observer set are not necessarily correlated. What we mean here is that usually on a given graph the best observers set for density d 1 are part of the best observers set for d 2 > d 1 . This is true for all placement methods except for HVO where observer set for two different densities can be completely different, and thus the response in precision can appear non-smooth.
Thirdly, the potential statistical variance effects described above cannot explain the discussed phenomenon in the case of real-world networks. When we use such a network it is set for every SI realisation, and thus no such variance exists. Similarly, the decreasing trend can also appear for other placement methods (see, e.g., Fig. S3, β = 0.2, PC+HCR), therefore the particular nature of HVO is also not sufficient to explain this behaviour. As such we are forced to conclude that precision is not necessarily a non-decreasing, monotonous function of the observer density and predicting what exactly can affect the performance of the source location estimator is a very complex and, most surely, understudied as of today. Bars within minor blocks show all methods, ordered from the best to the worst (a high precision indicates a high performance), with the background colour of the minor block indicating the best localisation method. The asterisk indicates the best localisation and placement strategy combination per row within a major block, i.e., for a given density, topology and infection rate. Bars are normalised to the highest score per graph, infection rate, and density.  Figure S20. Comparison of 0.95-CSS summary diagrams for network models (Barabási-Albert and Erdős-Rényi) with average degrees 8 and 16. The colours indicate the localisation methods, whereas observer placement strategies are marked per (minor) column in each block (labels are placed at the very bottom of the plot), while minor rows represent observer densities d.
Bars within minor blocks show all methods, ordered from the best to the worst (a high precision indicates a high performance), with the background colour of the minor block indicating the best localisation method. The asterisk indicates the best localisation and placement strategy combination per row within a major block, i.e., for a given density, topology and infection rate. Bars are normalised to the highest score per graph, infection rate, and density.  Figure S25. Summary diagrams of precision metric results for synthetic networks (major columns) and infection rates β (major rows). The colours indicate the localisation methods, whereas observer placement strategies are marked per (minor) column in each block (labels are placed at the very bottom of the plot), while minor rows represent the average degree in a network. Bars within minor blocks show all methods, ordered from the best to the worst (a high precision indicates a high performance), with the background colour of the minor block indicating the best localisation method. The asterisk indicates the best localisation and placement strategy combination per row within a major block, i.e., for a given density, topology and infection rate. Bars are normalised to the highest score per graph, infection rate, and density.  Figure S26. Summary diagrams of 0.95-CSS metric results for synthetic networks (major columns) and infection rates β (major rows). The colours indicate the localisation methods, whereas observer placement strategies are marked per (minor) column in each block (labels are placed at the very bottom of the plot), while minor rows represent the average degree in a network. Bars within minor blocks show all methods, ordered from the best to the worst (a high precision indicates a high performance), with the background colour of the minor block indicating the best localisation method. The asterisk indicates the best localisation and placement strategy combination per row within a major block, i.e., for a given density, topology and infection rate. Bars are normalised to the highest score per graph, infection rate, and density.

25/27
Results for a denser grid of infection rate and density of observers 10%